This report provides an evaluation of the accuracy and precision of probabilistic nowcasts and forecasts of weekly number of confirmed influenza hospital admissions submitted to the FluSight Hub. Some analyses include forecasts with reference dates falling within the last 24 weeks, starting in October 14, 2023. Others focus on evaluating “recent” forecasts, with reference dates falling within the last 4 weeks, starting in March 02, 2024.
The US Centers for Disease Control and Prevention (CDC), collects short-term forecasts from dozens of research groups around the globe. Every week CDC combines the most recent forecasts from each team into a single “ensemble” forecast for each of the targets. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Friday.
This report evaluates forecasts at the state level for weekly number of confirmed influenza hospital admissions for 0 to 3 week horizons, using similar methods that were employed for COVID-19 Evaluation Reports. Data by CDC on healthdata.gov (details here) is used as ground truth data for evaluating the forecasts.
We evaluate models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 4 weeks and for entire 2023-2024 season. To account for the variation in difficulty of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE, to attempt to adjust for teams submitting forecasts for different subsets of weeks, locations and horizons. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.
We generated scores in two ways, with the raw counts and with the log transformed counts. It has been argued that the log-transformation prior to scoring yields epidemiologically meaningful and easily interpretable results, while also reducing the impact of high-count locations on aggregated scores Bosse et al. (2023).
These evaluations are based on raw counts.
These tables evaluate forecasts in the four most recent weeks, and historical accuracy for all forecasts submitted in the current season. The first two tables evaluate forecasts based on their WIS and MAE, overall and by horizon. The last two tables evaluate prediction interval coverage rates, overall and by horizon.
Results are reported for “all locations” combined (an average across all 50 states, plus District of Columbia and Puerto Rico), as well as for each location separately. Results for “all locations” are shown as default. Use the dropdown menu to find results for a specific location.
This table only includes forecasts with reference dates falling within the last 4 weeks, since March 02, 2024. Additionally, models included for ‘All locations’ have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Models included for an individual location have submitted at least 50% of forecasts for that location during this time.
The data are initially ordered for all locations, by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
This table includes forecasts with reference dates falling within the last 24 weeks, since October 14, 2023. Additionaly, models included for ‘All locations’ have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Models included for an individual location have submitted at least 50% of forecasts for that location during this time.
The data are initially ordered for all locations, by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
This table only includes forecasts with reference dates falling within the last 4 weeks, since March 02, 2024. Additionally, models included for ‘All locations’ have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Models included for an individual location have submitted at least 50% of forecasts for that location during this time.
The data are initially ordered for all locations, by model based on their 95% PI coverage, with the models whose empirical coverage rates are closest to 95% at the top.
This table includes forecasts with reference dates falling within the last 24 weeks, since October 14, 2023. Additionaly, models included for ‘All locations’ have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Models included for an individual location have submitted at least 50% of forecasts for that location during this time.
The data are initially ordered for all locations, by model based on their 95% PI coverage, with the models whose empirical coverage rates are closest to 95% at the top.
The data in this graph has been aggregated over all locations and submission weeks. This figure only includes forecasts with reference dates falling within the last 4 weeks, since March 02, 2024. Additionally, models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons. The y axis is truncated at 95th percentile of the sum of the bars across models, rounded up to the nearest 10.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states. In the legend, the models with a dot and line have scores for every week, while the models with just a line are missing scores for at least one week.
For the figures, WIS is used as a metric, with the y axis truncated at the 97.5 percentile of the weekly average WIS. The first figure shows the mean WIS across all 50 states for submission weeks beginning October 14, 2023 at a 0 week horizon. The next 3 figures show the mean WIS aggregated across locations for 1, 2 and 3 week horizons. The last 4 figures show the empirical 95% PI coverage aggregated across locations for all horizons.
In this figure, the models with dashed lines are not included in the FluSight ensemble.
In this figure, the models with dashed lines are not included in the FluSight ensemble.
In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.
In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.
We would expect a well-calibrated model to have a value of 95% in this plot. In this figure, the models with dashed lines are not included in the FluSight ensemble. In this figure, the models with dashed lines are not included in the FluSight ensemble.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the larger horizons compared to the 0 week horizon. In this figure, the models with dashed lines are not included in the FluSight ensemble.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the larger horizons compared to the 0 week horizon. In this figure, the models with dashed lines are not included in the FluSight ensemble.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the larger horizons compared to the 0 week horizon. In this figure, the models with dashed lines are not included in the FluSight ensemble.
This figures below show recent model performance stratified by location. We only included forecasts for the last 4 weeks. Models were included if they submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states, selected jurisdictions and the national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This figure shows the number of weekly number of confirmed influenza hospital admissions reported in the US. The vertical blue line indicates the beginning of the “recent” model evaluation period; “recent forecasts” refers to forecasts with reference dates falling within this period. The vertical green line indicates the beginning of the “seasonal” model evaluation period. For this report, “historical forecasts” refers to forecasts with reference dates falling within this period.
These evaluations are based on log-transformed counts, which was recommended by Bosse et al. (2023).
These tables evaluate forecasts in the four most recent weeks, and historical accuracy for all forecasts submitted in the current season, based on log-transformed counts. The tables evaluate forecasts based on their WIS and MAE, overall and by horizon.
Results are reported for “all locations” combined (an average across all 50 states, plus District of Columbia and Puerto Rico), as well as for each location separately. Results for “all locations” are shown as default. Use the dropdown menu to find results for a specific location.
This table only includes forecasts with reference dates falling within the last 4 weeks, since March 02, 2024. Additionally, models included for ‘All locations’ have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Models included for an individual location have submitted at least 50% of forecasts for that location during this time.
The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
This table includes forecasts with reference dates falling within the last 24 weeks, since October 14, 2023. Additionally, models included for ‘All locations’ have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Models included for an individual location have submitted at least 50% of forecasts for that location during this time.
The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The data in this graph has been aggregated over all locations and submission weeks. This figure only includes forecasts with reference dates falling within the last 4 weeks, since March 02, 2024. Additionally, models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons. The y axis is truncated at 95th percentile of the sum of the bars across models, rounded up to the nearest 10.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states. In the legend, the models with a dot and line have scores for every week, while the models with just a line are missing scores for at least one week.
For the figures, WIS is used as a metric, with the y axis truncated at the 97.5 percentile of the weekly average WIS. The first figure shows the mean WIS across all 50 states for submission weeks beginning October 14, 2023 at a 0 week horizon. The next 3 figures show the mean WIS aggregated across locations for 1, 2 and 3 week horizons. The last 4 figures show the empirical 95% PI coverage aggregated across locations for all horizons.
In this figure, the models with dashed lines are not included in the FluSight ensemble.
In this figure, the models with dashed lines are not included in the FluSight ensemble.
In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.
In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.
This figures below show recent model performance stratified by location. We only included forecasts for the last 4 weeks. Models were included if they had submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states, selected jurisdictions and the national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This figure shows the number of weekly number of confirmed influenza hospital admissions reported in the US. The vertical blue line indicates the beginning of the “recent” model evaluation period; “recent forecasts” refers to forecasts with reference dates falling within this period. The vertical green line indicates the beginning of the “seasonal” model evaluation period. For this report, “historical forecasts” refers to forecasts with reference dates falling within this period.